Web Page Classification using Iterative Cross-Training Algorithm

نویسندگان

  • Nuanwan Soonthornphisaj
  • Boonserm Kijsirikul
چکیده

The paper presents a generalization of Iterative Cross-Training algorithm (ICT) which was previously applied to Thai Web pages identification [1]. The main concept of ICT is to iteratively train two sub-classifiers by using unlabeled examples in crossing manner. In this paper, we extend the algorithm in order to classify Web pages into course or non-course ones, which is a more challenging problem. We compare ICT against Supervised Naïve Bayes classifier and Co-Training classifier. The experimental results show that ICT still preserves its robustness on this new problem’s characteristic.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Combining ILP with Semi-supervised Learning for Web Page Categorization

This paper presents a semi-supervised learning algorithm called Iterative-Cross Training (ICT) to solve the Web pages classification problems. We apply Inductive logic programming (ILP) as a strong learner in ICT. The objective of this research is to evaluate the potential of the strong learner in order to boost the performance of the weak learner of ICT. We compare the result with the supervis...

متن کامل

A Cross Training Corrective Approach for Web Pages Classification

Textual document classification is one challenging area of data mining. Web page classification is a type of textual document classification. However, the text contained in web pages is not homogenous since a web page can discuss related but different subjects. Thus, results obtained by a textual classifier on web pages are not as better as those obtained on textual documents. Therefore, we nee...

متن کامل

Efficient Prediction of Cross-Site Scripting Web Pages using Extreme Learning Machine

Malicious code is a way of attempting to acquire sensitive information by sending malicious code to the trustworthy entity in an electronic communication. JavaScript is the most frequently used command language in the web page environment. If the hackers misuse the JavaScript code there is a possibility of stealing the authentication and confidential information about an organization and user. ...

متن کامل

Iterative cross-training: An algorithm for learning from unlabeled Web pages

The paper presents a learning method, called Iterative Cross-Training (ICT) , for classifying Web pages in two classification problems, i.e., (1) classification of Thai/non-Thai Web pages, and (2) classification of course/non-course home pages. Given domain knowledge or a small set of labeled data, our method combines two classifiers that are able to effectively use unlabeled examples to iterat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000